I’m working on evaluating open-source LLMs (e.g., Phi, Llama, Qwen), and I’ve noticed that the benchmark scores I get are consistently different from the ones reported in their tech reports or papers — sometimes by a wide margin.
Sometimes the results are lower than expected, but surprisingly, sometimes they’re higher too. My point is: there are (many) cases where the difference is quite large, and it’s not clear why.
I’ve tried:
- Using lm-eval-harness with the default settings
- Matching tokenizers and prompt formats as best as possible
- Evaluating on standard benchmarks like MMLU, GSM8K, ARC, etc, in the reports under the same few-shot conditions
Despite this, the scores I get are often significantly different from what’s published — and I can’t find any official scripts or clear explanations of the exact benchmarking setup used in those papers.
This seems to happen not just with one model, but across many open-source models.
Is this a common experience in the community?
- Are papers using special prompt engineering or internal eval setups they don’t release?
- Am I missing some key benchmarking tricks?
- Is this just part of the game at this point?
Would really appreciate if anyone can share:
- Experience trying to reproduce scores
- Any evaluation tips
- Benchmarking setups that actually match reported numbers
Thanks in advance!